Correcting the Document Layout: A Machine Learning Approach

نویسندگان

Donato Malerba

Floriana Esposito

Oronzo Altamura

Michelangelo Ceci

Margherita Berardi

چکیده

In this paper, a machine learning approach to support the user during the correction of the layout analysis is proposed. Layout analysis is the process of extracting a hierarchical structure describing the layout of a page. In our approach, the layout analysis is performed in two steps: firstly, the global analysis determines possible areas containing paragraphs, sections, columns, figures and tables, and secondly, the local analysis groups together blocks that possibly fall within the same area. The result of the local analysis process strongly depends on the quality of the results of the first step. We investigate the possibility of supporting the user during the correction of the results of the global analysis. This is done by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from the sequence of user actions. Experimental results on a set of multi-page documents are reported and commented. 1. Background and motivation Strategies for the extraction of layout analysis have been traditionally classified as top-down or bottom-up [10]. In top-down methods, the document image is repeatedly decomposed into smaller and smaller components, while in bottom-up methods, basic layout components are extracted from bitmaps and then grouped together into larger blocks on the basis of their characteristics. In WISDOM++, a document image analysis system that can transform paper documents into XML format [1], the applied page decomposition method is hybrid, since it combines a top-down approach to segment the document image, and a bottom-up layout analysis method to assemble basic blocks into frames. Some attempts to learn the layout structure from a set of training examples have also been reported in the literature [2,3,4,7,11]. They are based on ad-hoc learning algorithms, which learn particular data structures, such as geometric trees and tree grammars. Results are promising, although it has been proven that good layout structures could also be obtained by exploiting generic knowledge on typographic conventions [5]. This is the case of WISDOM++, which analyzes the layout in two steps: 1. A global analysis, in order to determine possible areas containing paragraphs, sections, columns, figures and tables. This step is based on an iterative process, in which the vertical and horizontal histograms of text blocks are alternately analyzed, in order to detect columns and sections/paragraphs, respectively. 2. A local analysis to group together blocks that possibly fall within the same area. Generic knowledge on west-style typesetting conventions is exploited to group blocks together, such as “the first line of a paragraph can be indented” and “in a justified text, the last line of a paragraph can be shorter than the previous one”. Experimental results proved the effectiveness of this knowledge-based approach on images of the first page of papers published in conference proceedings and journals [1]. However, performance degenerates when the system is tested on intermediate pages of multi-page articles, where the structure is much more variable, due to the presence of formulae, images, and drawings that can stretch over more than one column, or are quite close. The majority of errors made by the layout analysis module were in the global analysis step, while the local analysis step performed satisfactorily when the result of the global analysis was correct. In this paper, we investigate the possibility of supporting the user during the correction of the results of the global analysis. This is done by allowing the user to correct the results of the global analysis and then by learning rules for layout correction from his/her sequence of actions. This approach is different from those that learn the layout structure from scratch, since we try to correct the result of a global analysis returned by a bottom-up algorithm. Furthermore, we intend to capture knowledge on correcting actions performed by the user of the document image processing system. Other document processing systems allow users to correct the result of the layout analysis; nevertheless WISDOM++ is the only one that tries to learn correcting actions from user interaction with the system. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 $17.00 © 2003 IEEE In the following section, we describe the layout correction operations. The automated generation of training examples is explained in Section 3. Section 4 introduces the learning strategy, while Section 5 presents some experimental results. 2. Correcting the layout Global analysis aims at determining the general layout structure of a page and operates on a tree-based representation of nested columns and sections. The levels of columns and sections are alternated (Figure 1), which means that a column contains sections, while a section contains columns. At the end of the global analysis, the user can only see the sections and columns that have been considered atomic, that is, not subject to further decomposition (Figure 2). The user can correct this result by means of three different operations: • Horizontal splitting: a column/section is cut horizontally. • Vertical splitting: a column/section is cut vertically. • Grouping: two sections/columns are merged together. The cut point in the two splitting operations is automatically determined by computing either the horizontal or the vertical histogram on the basic blocks returned by the segmentation algorithm. The horizontal (vertical) cut point corresponds to the largest gap between two consecutive bins in the horizontal (vertical) histogram. Therefore, splitting operations can be described by means of a unary function, split(X), where X represents the column/section to be split and the range is the set {horizontal, vertical, no_split}. The grouping operation, which can be described by means of a binary predicate group(A,B), is applicable to two sections (columns) A and B and returns a new section (column) C, whose boundary is determined as follows. Let (leftX, topX) and (bottomX, rightX) be the coordinates of the top-left and bottom-right vertices of a column/section X, respectively. Then: leftC= min(leftA, leftB), rightC=max(rightA,rightB), topC=min(topA,topB), bottomC=max(bottomA,bottomB). Grouping is possible only if the following two conditions are satisfied: 1. C does not overlap another section (column) in the document. 2. A and B are nested in the same column (section). After each splitting/grouping operation, WISDOM++ recomputes the result of the local analysis process, so that the user can immediately perceive the final effect of the requested correction and can decide whether to confirm the correction or not. 3. Representing corrections From the user interaction, WISDOM++ implicitly generates some training observations describing when and how the user intended to correct the result of the global analysis. These training observations are used to learn correction rules of the result of the global analysis, as explained in the next section. The simplest representation describes, for each training observation, the page layout at the i-th correction step and the correcting operation performed by the user on that layout. Therefore, if the user performs n-1 correcting operations, n observations are generated. The last one corresponds to the page layout accepted by the user. In the learning phase, this representation may lead the system to generate rules which strictly take into account the exact user correction sequence. However, several alternative correction sequences, which lead to the same result, may be also possible. If they are not considered, the learning strategy will suffer from data overfitting problems. This issue was already discussed in a preliminary work [9]. A more sophisticated representation, which takes into account alternative correction sequences, is based on the Column level

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Correcting Keyboard Layout Errors and Homoglyphs in Queries

Keyboard layout errors and homoglyphs in cross-language queries impact our ability to correctly interpret user information needs and offer relevant results. We present a machine learning approach to correcting these errors, based largely on character-level n-gram features. We demonstrate superior performance over rule-based methods, as well as a significant reduction in the number of queries th...

متن کامل

Document Content Layout Based Exploit Protections

Malware laden documents are a common exploit vector, especially in targeted attacks. Most current approaches seek to detect the malicious attributes of documents whether through signature matching, dynamic analysis, or machine learning. We take a different approach: we perform transformations on documents that render exploits inoperable while maintaining the visual interpretation of the documen...

متن کامل

Machine Learning for Reading Order Detection in Document Image Understanding

Document image understanding refers to logical and semantic analysis of document images in order to extract information understandable to humans and codify it into machine-readable form. Most of the studies on document image understanding have targeted the specific problem of associating layout components with logical labels, while less attention has been paid to the problem of extracting relat...

متن کامل

Machine Reliability in a Dynamic Cellular Manufacturing System: A Comprehensive Approach to a Cell Layout Problem

The fundamental function of a cellular manufacturing system (CMS) is based on definition and recognition of a type of similarity among parts that should be produced in a planning period. Cell formation (CF) and cell layout design are two important steps in implementation of the CMS. This paper represents a new nonlinear mathematical programming model for dynamic cell formation that employs the ...

متن کامل

An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

In this paper we present an integrated approach for semantic structure extraction in document images. Document images are initially processed to extract both their layout and logical structures on the base of geometrical and spatial information. Then, textual content of logical components is employed for automatic semantic labeling of layout structures. To support the whole process different ma...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Correcting the Document Layout: A Machine Learning Approach

نویسندگان

چکیده

منابع مشابه

Correcting Keyboard Layout Errors and Homoglyphs in Queries

Document Content Layout Based Exploit Protections

Machine Learning for Reading Order Detection in Document Image Understanding

Machine Reliability in a Dynamic Cellular Manufacturing System: A Comprehensive Approach to a Cell Layout Problem

An Integrated Approach for Automatic Semantic Structure Extraction in Document Images

عنوان ژورنال:

اشتراک گذاری